This is my final project for the GIS class in the Quantitative Methods in the Social Sciences program at Columbia University. For my project, I decided to explore the relationship between tweeting and gentrification.

I started the analysis with one independent variable (rate of change in rents between 2010 and 2015), one dependent variable (amount of tweets in an area) and two control variables (rate of change in education level and income in an area), but ended up dropping the control variable on education. Initially, I also performed my analysis on the level of subborrough areas. This proved problematic, as there are only a few dozen subborrough areas.

In this final version, I use block groups from the United States Tigerlines Census files and associated data on rents and income. The tweets were scraped during one day in November. While this only provided a few thousand tweets, the results from regressing the data at the block group level are encouraging. It seems like it would be worth exploring the connection of tweets to more established gentrification measurments with a bigger sample of tweets.

The hypothesis is that the amount of tweets in an area is positively related to the change in rents in an area. The null hypothesis is that no such relationship exists.

Mapping

First, I want to explore how the data looks like when mapping to block groups instead of subborrough areas. For this, I preprocess the tweets in a few steps, normalizing them for the size of each census block group by dividing the amount of tweets in a block group by the area of the block group and then multiplying this results with the mean of the block group areas. I also calculate the rate of change in income and rent between 2010 and 2015. This produces the following set of leaflet maps:

Regression models

The takeaway from the maps is that it looks like there might be a relationship between the rates of change of income and rent to tweeting, as some areas like Brooklyn around Williamsburg and Bedstuy look like they are in the upper parts of the value distributions for all three variables. An Ordinary Least Squares regression confirms that some kind of relationship might exist. Between tweets and the rate of change of income, the control variable seems to explain more of the variation in the independent variable. Nonetheless, there is a weak but statistically significant positive relationship between the amount of tweets in a block group and the rate of change of rents during the past five years.

## 
## Call:
## lm(formula = rent_roc ~ inc_roc + twtn, data = rent_income20102015_toGEODA@data, 
##     na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8227 -0.1661 -0.0381  0.1048 13.4941 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.1857012  0.0057533   32.28  < 2e-16 ***
## inc_roc     0.1567649  0.0111708   14.03  < 2e-16 ***
## twtn        0.0033057  0.0009928    3.33 0.000875 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3893 on 5442 degrees of freedom
##   (840 observations deleted due to missingness)
## Multiple R-squared:  0.03757,    Adjusted R-squared:  0.03721 
## F-statistic: 106.2 on 2 and 5442 DF,  p-value: < 2.2e-16

I then proceeded to create a weights matrix, which can in turn be uses to get the the global Moran’s I to see how on average the relationship between a given polygon and its neighbours is different from what one would expect under spatial randomness.

For the basic OLS model, the global value of Moran’s I is 0.02017342 and highly statistically significant with a p-value of 0.01403. This indicates that the spatial autocorrelation of residuals is almost random and there is no clear clustering of similar values. The reason for this might be my limited Twitter data, with values of zero for the majority of block groups. This also caused a problem when looking for quantiles in the data, as more than one quantile was simply zero. As overall variance in the Twitter data is low, there is little room for spatial autocorrelation patterns to emerge.

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = rent_roc ~ inc_roc + twtn, data =
## rent_income20102015_toGEODA@data)
## weights: W
## 
## Moran I statistic standard deviate = 2.4564, p-value = 0.01403
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     2.017342e-02    -2.579993e-04     6.918066e-05
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = rent_roc ~ inc_roc + twtn, data =
## rent_income20102015_toGEODA@data)
## weights: W
## 
## LMerr = 5.887, df = 1, p-value = 0.01525
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = rent_roc ~ inc_roc + twtn, data =
## rent_income20102015_toGEODA@data)
## weights: W
## 
## LMlag = 10.714, df = 1, p-value = 0.001063
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = rent_roc ~ inc_roc + twtn, data =
## rent_income20102015_toGEODA@data)
## weights: W
## 
## RLMerr = 27.7, df = 1, p-value = 1.416e-07
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = rent_roc ~ inc_roc + twtn, data =
## rent_income20102015_toGEODA@data)
## weights: W
## 
## RLMlag = 32.527, df = 1, p-value = 1.175e-08
## 
## 
##  Lagrange multiplier diagnostics for spatial dependence
## 
## data:  
## model: lm(formula = rent_roc ~ inc_roc + twtn, data =
## rent_income20102015_toGEODA@data)
## weights: W
## 
## SARMA = 38.414, df = 2, p-value = 4.555e-09

The spatial lag and error models test for spatial autocorrelation in the linear model using two different approaches. The spatial lag model includes a spatial weights matrix multiplied with the dependent variable in the linear model, creating bias and inconsistency. I initially tried to obtain these values using R, but RStudio kept crashing when running either the spatial lag or spatial error models. I thus ran them in GeoDa, saving the regression coefficients and related parameters as images and importing the residuals as a shapefile.

For my spatial lag model, the results were the following:

Summary for the spatial lag regression model

Summary for the spatial lag regression model

Here, the z-value indicates distance to the mean as measured in standard deviations. A high z-value combined with a low p-value indicates that it is unlikely that the observation follows a random spatial pattern. Income and tweets have low p-values and fairly or very high z-values. The coefficient for income is fairly high and positive but extremely small for tweets.

The spatial error model, on the other hand, operates by including the weights matrix in the error term. The z-value, p-value and coefficients are similar to the lag model, but slightly smaller. The spatial lag model seems to be a better fit, which is further verified when we look at the residual mean square error below.

Summary for the spatial lag regression model
Finally, the residuals maps show the difference between the observed and the fitted data. The smaller the residuals, the better the fit. The residual mean squared error is one measurement of the quality of the fit. For OLS, it is 0.3892337, for the spatial lag mdoel 0.3843643 and for the spatial error model 0.3843643. Following this criteria, the best fit is provided by the spatial lag model, although the differences are neglible. In the spatial lag (and the two other) models, the tweets have a statistically significant positive relationship to the rate of chagne in rent between 2010 and 2015.

The size of the coefficient is small, but a larger sample of tweets might increase this value. Hence, we can reject the null hypothesis of no relationship existing: there is indication of a weak relationship, that would be worth exploring in further research.

Residual map for OLS:

Residual map for spatial lag: